single-cell rna
scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
Xu, Ping, Wang, Zaitian, Wang, Zhirui, Li, Pengjiang, Zhang, Ran, Li, Gaoyang, Xie, Hanyu, Wang, Jiajia, Zhou, Yuanchun, Wang, Pengfei
Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.
Contrastive inverse regression for dimension reduction
Hawke, Sam, Luo, Hengrui, Li, Didong
Supervised dimension reduction (SDR) has been a topic of growing interest in data science, as it enables the reduction of high-dimensional covariates while preserving the functional relation with certain response variables of interest. However, existing SDR methods are not suitable for analyzing datasets collected from case-control studies. In this setting, the goal is to learn and exploit the low-dimensional structure unique to or enriched by the case group, also known as the foreground group. While some unsupervised techniques such as the contrastive latent variable model and its variants have been developed for this purpose, they fail to preserve the functional relationship between the dimension-reduced covariates and the response variable. In this paper, we propose a supervised dimension reduction method called contrastive inverse regression (CIR) specifically designed for the contrastive setting. CIR introduces an optimization problem defined on the Stiefel manifold with a non-standard loss function. We prove the convergence of CIR to a local optimum using a gradient descent-based algorithm, and our numerical study empirically demonstrates the improved performance over competing methods for high-dimensional data.
AI predicts effective drug combinations to fight complex diseases faster
Finding new ways to repurpose or combine existing drugs has proved to be a powerful tool to treat complex diseases. Drugs used to treat one type of cancer, for instance, have effectively strengthened treatments for other cancer cells. Complex malignant tumors often require a combination of drugs, or "drug cocktails," to formulate a concerted attack on multiple cell types. Drug cocktails can not only help stave off drug resistance but also minimize harmful side effects. But finding an effective combination of existing drugs at the right dose is extremely challenging, partly because there are near-infinite possibilities.
Daily Digest September 16, 2019 – BioDecoded
Reseachers benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. The general-purpose support vector machine classifier has overall the best performance across the different experiments. Researchers present a novel algorithm for predicting genetic ancestry using only variables that are routinely captured in electronic health records (EHRs), such as self-reported race and ethnicity, and condition billing codes. Using patients that have both genetic and clinical information at Columbia University / New York-Presbyterian Irving Medical Center, they developed a pipeline that uses only clinical data to predict the genetic ancestry of all patients of which more than 80% identify as other or unknown.
Using artificial intelligence for error correction in single-cell RNA sequencing
The increased sensitivity of the technique, however, also means increased susceptibility to the batch effect. "The batch effect describes fluctuations between measurements that can occur, for example, if the temperature of the device deviates even slightly or the processing time of the cells changes," Maren Büttner explains. Although several models exist for the correction of these deviations, those methods are highly dependent on the actual magnitude of the effect. "We therefore developed a user-friendly, robust and sensitive measure called kBET that quantifies differences between experiments and therefore facilitates the comparison of different correction results," Büttner says.
A deep generative model for gene expression profiles from single-cell RNA sequencing
Lopez, Romain, Regier, Jeffrey, Cole, Michael, Jordan, Michael, Yosef, Nir
We propose a probabilistic model for interpreting gene expression levels that are observed through single-cell RNA sequencing. In the model, each cell has a low-dimensional latent representation. Additional latent variables account for technical effects that may erroneously set some observations of gene expression levels to zero. Conditional distributions are specified by neural networks, giving the proposed model enough flexibility to fit the data well. We use variational inference and stochastic optimization to approximate the posterior distribution. The inference procedure scales to over one million cells, whereas competing algorithms do not. Even for smaller datasets, for several tasks, the proposed procedure outperforms state-of-the-art methods like ZIFA and ZINB-WaVE. We also extend our framework to account for batch effects and other confounding factors, and propose a Bayesian hypothesis test for differential expression that outperforms DESeq2.
A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes
Lopez, Romain, Regier, Jeffrey, Cole, Michael, Jordan, Michael, Yosef, Nir
We propose a probabilistic model for interpreting gene expression levels that are observed through single-cell RNA sequencing. In the model, each cell has a low-dimensional latent representation. Additional latent variables account for technical effects that may erroneously set some observations of gene expression levels to zero. Conditional distributions are specified by neural networks, giving the proposed model enough flexibility to fit the data well. We use variational inference and stochastic optimization to approximate the posterior distribution. The inference procedure scales to over one million cells, whereas competing algorithms do not. Even for smaller datasets, for several tasks, the proposed procedure outperforms state-of-the-art methods like ZIFA and ZINB-WaVE. We also extend our framework to take into account batch effects and other confounding factors and propose a natural Bayesian hypothesis framework for differential expression that outperforms tradition DESeq2.